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Summary 

In bringing up an old operating system on a simulator, the assumption must be 
that any problem is the simulator's fault; after all, the operating system worked on 
real hardware. This assumption has not always proved to be true. Simulators on 
modern PC's are often significantly faster than real hardware and thus may 
expose race conditions or timing bugs. Simulators may exercise code paths that, 
in late stage operating systems, were no longer used, such as full installs. 
Simulators may create configurations that were not practical, due to physical or 
financial limitations. Finally, simulators may present late stage operating 
systems with hardware configurations that, while nominally supported, could in 
practice no longer be tested. 

Timing Problems 

On modern PC's, simulators for a computer architecture are often significantly 
faster than any real hardware that was ever built. PDP-1 simulators, for 
example, have been clocked at over 10 mips; the fastest DEC PDP-1 (the 
KL1 0) was 1 .5 mips. Simulated devices are often much faster than their real 
counterparts. These speed changes can expose timing dependencies in 
operating system code. 

A trivial case is a timing loop. Some software environments, such as console- or 
microcomputer-based games, are very dependent on timing loops. Timing loops 
also occur in system bootstraps. For example, the VAX KA655 boot code uses 
delay loops executing directly from boot ROM to run "slowly" as a wait loop on 
clock ticks. Finally, timing loops show up frequently in diagnostics. 

More subtle problems occur around interrupts. Operating system code often 
assumes that a large amount of time elapses between initiation of an I/O 
operation and receipt of the completion interrupt. If the interrupt "too soon", it 
may be misinterpreted or lost. All versions of the RSX1 1 M+ MSCP driver prior to 
V4.5 had this problem in the initialization sequence. VAX NetBSD driver has this 
problem during normal operations, as Kevin Handy documented in this note to 
the author: 

Starting at '1 ->', we set up a mscp packet to put the drive online. At '2->' we ping tine 
mscp controller to take a look at it's packets. And '3->' waits up to (100*100) time units 
for the controller to respond with an interrupt. 

The problem, is that by the time it gets to '3->', the interrupt has already occurred and 
been processed. It's waiting for an interrupt that has already occurred, thus the timeout 
fails. You can see it by single stepping through the code (it suddenly jumps out of the 
sequence, putters around for a while, then jumps back in). 



The CPU is expecting to liave enougli time to set up a timeout routine before it will get a 
response back. It's not expecting an instant response back. You need to delay the 
responses from your emulated controllers for <mumble> instructions/microseconds, and 
then you will then get past this problem. 

Ken Harrenstein found a similar problem in the disk driver in ITS. 

To address these issues, the SIMM MSCP simulator simulates 'delays' between 
initialization steps, and between initiation of an operation and completion. The 
delays have to be tuned experimentally to get the right values. For example, M+ 
requires at least 200 instructions between initialization steps, but RSTS/E can 
tolerate virtually no delay after the completion of step 4. 

Finally, the changed timing of the simulation environment may expose race 
conditions and bugs that have lain dormant in the code. RSX1 1 M+ has a 
compound bug in which a coding error in MSCP device initialization is masked, 
on real hardware, by the outcome of a timing race condition. If the boot device is 
an MSCP disk, M+ routine RVEC brings up first the controller (routine $KRBSC) 
and then the boot disk (routine $UCBSC) by issuing three MSCP commands: 

- Set controller characteristics 

- Unit online 

- Get unit status 

There is a bug in the MSCP driver's handling of the get unit status command. In 
the interrupt handler for command completion, routine RQRCT destroys the 
success status code and overwrites it with 0310 (bad block replacement 
needed). If the MSCP disk is 'fast', or the driver code paths are really long, the 
get unit status command completes before control returns to $UCBSC. $UCBSC 
sees an error status and marks the disk as offline, causing the bootstrap to fail. 
This is what happens on the simulator with M+ V3.0. 

On the other hand, if the MSCP disk is slow, or the driver code paths tighter, 
control returns to $UCBSC while the get unit status is still in progress. The error 
status code is from the successful unit online, and $UCBSC marks the boot disk 
as online and returns. This is what happens on the simulator with M+ V4.0 or 
later, and, apparently, with real hardware. 

Even with the timing race falling the 'right' way, it requires another bug to prevent 
routine RVEC from seeing the erroneous status code from the get unit status. 
When $UCBSC returns, RVEC sees that the unit online sequence is not 
complete and waits for the get unit status to set a final status code. When that 
status (the erroneous 031 0) is set, it is ignored. RVEC only checks to see 
whether the disk is online. And the disk is online, because $UCBSC set status 
from the unit online command, rather than the get unit status command. 



Interestingly, when the bug in RQRCT was addressed in M+ V4.0, the fix was 
incorrect, and the code continued to work only because of the timing race 
condition. 

To get around this race condition, the SIMM MSCP simulator command 
completion delay must be tuned experimentally. M+ 3.0 requires at least 1 75 
instructions between initiation and completion of a command. 

Unused Paths 

Simulator users routinely perform full installations of operating systems onto 
empty disks; indeed, a full installation is one of the litmus tests for simulator 
success. But in real life, this path might no longer be used or tested. DEC 
ceased production on DECsystem-10's in the early 80's but continued to update 
TOPS-10 through 1988. When the last release (7.04) came out, there were no 
new DECsystem-10's requiring full installs, and the code path was insufficiently 
tested. And, in fact, it contains a bug. This problem burned Tim Stark during 
debug of TS1 0, as documented in this note to KLH1 author Ken Harrenstein: 

There's also a bug that interferes with TOPS-10 7.04 from being built correctly from 
scratch; that was presumably not found because no one was doing clean installs in 1988. 
It has to do with enumerating magtape channels or units; the code's counting loop 
overflows from the MTCS2 formatter select field into the Unibus address inhibit, so that 
the next magtape read doesn't work. SIMM got away with it because I didn't implement 
address inhibit, but Tim Stark got burned in TS10 because he did. (He thought the driver 
required it.) 

Because SIMM doesn't accurately follow the hardware, it is, ironically, immune 
from this problem. 

A more complex case is a magtape boot bug in TOPS-20 V4.1 for the KS1 0. 
The magtape bootstrap is read into low memory and then relocated to high 
memory for execution. For some reason, the move is done with EXCH 
instructions rather than conventional moves, thus replacing the low core image 
with the contents of high memory. The bootstrap contains the instruction 
WRCSTM [77B5]. After relocation of the bootstrap, the WRCSTM's address is 
still pointing to low core, which has been overwritten. The WRCSTM writes 
garbage to the CSTM, and the boot fails, as documented in this note in 
alt.sys.pdpIO: 

The tape bootstrap moves itself into high memory with a routine that exchanges memory 
locations, rather than copies them. (I have no idea why.) The WRCSTM instruction in 
the boot references absolute address 40127, but that's been copied to high memory, and 
garbage (zero for the simulator) exchanged into its place. When paging is turned on, the 
simulator gets an age page fail error, because the CSTM is all O's, and the age bit gets 
zeroed on the second page fill. Ugh. If I run the boot again, in the same core image, it 
works, because the contents of 40127 are already in high memory and are brought back 
to the right spot by the exchange. 



How could such an obvious problem been overlooked? One suggestion - that 
the tape bootstrap of V4.1 had simply not been tested on the KS1 - was 
indignantly rejected by veterans of the TOPS-20 group. They insisted that the 
code worked on a real KS10 CPU but could not explain how. 

The answer, perhaps, lies in the observation that the bootstrap succeeds the 
second time, because the exchange moves a copy of the bootstrap back to low 
memory, and the WRCSTM retrieves the correct data. On a real KS1 0, the front- 
end console had a watchdog timer. If the main CPU failed to respond with a 
heartbeat in a given amount of time, the console would reboot the system - 
without disturbing memory. The second bootstrap would succeed. From the 
viewpoint of anyone debugging the bootstrap process on real hardware, there 
would be a small tape movement, a delay, a backspace, and then a normal boot. 
If the tape motion wasn't noticed, the delay could be ascribed to self-test 
procedures in the front-end console or other "normal" delays. The system did 
boot; there was no need to look deeper. 

Impractical Configurations 

In today's computers with megabytes of memory and gigabytes of storage, the 
largest configuration of a historic computer represents a tiny fraction of the 
available resources. Simulators can create configurations that for physical or 
financial reasons were impractical with real hardware. For example, the S IMH 
PDP-1 5 simulator supports an RF1 5 fixed head disk controller with up to 8 RS09 
fixed head disks. In practice, no customer would buy that many fixed head disks; 
instead, the customer would buy an RP1 5/RP02 disk pack, which provided five 
times the storage at lower cost. 

Apparently, the maximum RF15 configuration was never tested with the PDP- 
15's DOS-15 operating system. The predecessor operating system, ADSS-15, 
had been limited to 4 RS09 disks. DOS-1 5 increased this to 8, but the 
configuration was never tested, as Hans Pufal documented in a mail message: 

The OS exits to lOPS error code 21 when it reaches a platter 
number 010. The problem is that with 8 platters there will never be a 
NED indication. I think the problem is in the OS code: 



75072: 


CLA 


; set platter to 


75073: 


lOT 7045 


; force controller idle, clear done 


75074: 


lOTO 


; padding 


75075: 


lOTO 




75076: 


lOTO 

; Top of platter loop 




75077: 


lOT 7065 


; set disk platter 


75100: 


lOTO 


; padding 


75101: 


lOTO 




75102: 


DSSF 


; skip on error (NED) 


75103: 


JMP 75106 


; jump if disk exists 



75104: 


DSCD 




; clear status 


75105: 


JMP 75113 




; found NED, AC equals number of platters 


;I 


Disk exists, inc 


disk and loop back if not done 8 


75106: 


DSCD 




; clear status 


75107: 


TAD 75401 


= 001 


;add 1 


75110: 


SAD 75722 


= 010 


; compare with 010 


75111: 


JMP 75231 




; jmp out if disks = 8 


75112: 


JMP 75077 




; not 8 so go back for next disk 


75113: 


DAC 75072 




; store # of platters 


75114: 


SNA CLL 




; skip if AC = 


75115: 


JMP 75231 
; Error patli 




; jump to lOPS error 


75231 : 


LAC 75131 






75232: 


DAC* 75731 






75233: 


LAW 21 




; lOPS number 


75234: 


JMP 75240 




; go do lOPS error 



I think the JMP at 751 1 1 should be a SKP. 

And indeed it should. A maximum RF15 configuration, impractically expensive at 
the time, was never tested. 

Untestable Configurations 

A simulator can mimic any implementation of a computer architecture. Further, it 
can implement an arbitrary assemblage of peripherals. This flexibility may 
significantly exceed the testing capabilities available to real developers in late 
stage operating systems. For example, the SIMM PDP-1 1 simulator emulates a 
KDJ11 A CPU with broad set of peripherals ranging from DECtape (out of 
production by the early 70s) to MSCP disks (still current in the early 90s). DEC 
in its heyday would have been hard pressed to assemble such an eclectic set of 
devices. Therefore, it is not surprising that by the late 90's, the skeleton crew 
maintaining the PDP-1 1 operating systems could no longer test older hardware. 

This problem is evident in the behavior of RSX1 1 M+ V4.5 autoconfigure. V4.2 
correctly identifies the simulator as an LSI-11/73 (KDJ11 A CPU). But V4.5 
identifies it as an "M1 1 ", Mentec's 1 997 re-implementation of the J-1 1 in gate 
arrays. What happened? 

M+ autoconfigure implements a series of tests that act as a sieve to eliminate 
classes of PDP-1 1 processors. When the tests are done, one and only one CPU 
model should be flagged. The tests are very fine grained, but the KDJ1 1 A and 
Ml 1 are a/mosHdentical. Both respond with MFPT = 5 and maintenance ID = 
20. To distinguish them, the following code was added to autoconfigure in V4.5 
(as disassembled by the simulator): 



1 31 640: MOV @#1 7231 7,@#1 7231 7 

131646: BIT #100,@#1 7231 6 

131654: BEQ 131664 

131656: BIG #200,R4 

131662: BR 132124 

131664: BIG #20000, R4 

131670: BR 132056 



PDR7 has W bit set 

write odd byte of kernel PDR7 

is W bit still set? 

if eq no 

if ne yes, clear J1 1 bit (ie, it's an Ml 1 ) 

if eq no, clear Ml 1 bit (ie, it's a J1 1 ) 



This code sequence cannot work as written. On the KDJ1 1 A, and presumably 
on the M 11 , the MOV instruction accesses an odd address and traps while 
fetching the source address. The trap handler simply RTI's, and the third word of 
the MOV is executed as an instruction ADDF F3,(PC), which is harmless. 
Because the PDR is not actually written, the W bit isn't cleared, and the CPU is 
always classified as an Ml 1 . What is going on? 

The answer comes by comparison with the CPU identification code in routine 
SAVSIZ: 



20$: MOV #KISDR7+1,R0 

MOVB (RO),(RO) 

BITB #100,-(R0) 

BNE 60$ 



POINT TO KERNEL PDR7 
WRITE THE HIGH BYTE OF THE PDR 
DOES IT SHOW WRITTEN? 
IF NE, YES, WE HAVE AN Ml 1 



DG535 
DG535 
DG535 
DG535 



This sequence i/v/V/work. The MOVB doesn't trap. On a KDJ1 1 A, a write to the 
PDR clears the W bit, even if the PDR is mapping itself. On the Ml 1 , apparently, 
it does not. 

How did the bug in autoconfigure go undetected? One possibility is that 
autoconfigure was not tested. But a more compelling hypothesis is that the 
developer simply didn't have a KDJ1 1 A available for testing. The KDJ1 1 A is a 
relatively rare survival as a system processor; most J1 1 -based PDP-1 1 systems 
were built with the KDJ1 1 B, D, or E processor modules. The developer tried the 
code on an Ml 1 , and it worked; he probably didn't have a KDJ1 1 A available to 
see that it didn't. 

Conclusion 



In debugging a simulator, 99% of all problems that occur in bringing up an 
operating system will be the simulator's fault. Occasionally, the problem will be 
in the operating system itself. Operating systems contain timing dependencies 
that simulated devices break or may not have been tested against all possible 
hardware configurations. Late stage operating systems suffer from inadequate 
staffing, incomplete test facilities, and other limitations. The result is introduction 
of bugs through coding mistakes or "code rot" (code breakage as a side effect of 
new features). Locating these problems, and tracing them to root causes, is one 
of the most difficult challenges in simulator debugging. 
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